Cory Whitney
Andrew MacDonald @polesasunder
Keep it tidy
Use ‘#’ to annotate and not run
If not Rmarkdown then at least use ‘—-’ or ‘####’
#Section 1—-
#Section 2####
#Section 3####
TOC in upper right console
Keep it tidy
Check your R version
version
The easiest way to get libraries for today is to install the whole tidyverse:
library(tidyverse)
Keep it tidy
Learn about tidyverse with browseVignettes:
browseVignettes(package = "tidyverse")
Keep it tidy
Keep it tidy
Three things make a dataset tidy:
Format of dplyr
Arguments start with a data frame
Load data
participants_data <- read.csv("participants_data.csv")
Using dplyr
library(dplyr)
and others we need today
library(knitr)
library(tidyr)
library(dplyr)
library(magrittr)
Roger Peng
genomicsclass.github.io/book/pages/dplyr_tutorial
Subsetting
Select
aca_work_filter <- select(participants_data, academic_parents, working_hours_per_day)
Subsetting
Select
non_aca_work_filter <- select(participants_data, -academic_parents, -working_hours_per_day)
Subsetting
Filter
work_filter <- filter(participants_data, working_hours_per_day >10)
Subsetting
Filter
work_name_filter <- filter(participants_data, working_hours_per_day >10 & letters_in_first_name >6)
Rename
participants_data <- rename(participants_data, name_length = letters_in_first_name)
Rename
participants_data <- rename(participants_data,
daily_labor = working_hours_per_day)
Mutate
participants_data <- mutate(participants_data, labor_mean = daily_labor*mean(daily_labor))
Mutate
Create a commute category
participants_data <- mutate(participants_data, commute = ifelse(km_home_to_zef > 10, "commuter", "local"))
Group group data by commuters and non-commuters
commuter_data <- group_by(participants_data, commute)
Summarize get a summary of travel times and days to response
commuter_summary <- summarize(commuter_data, mean(days_to_email_response), median(name_length))
Pipeline %>%
pipe_data <- participants_data %>%
mutate(commute = ifelse(
km_home_to_zef > 10,
"commuter", "local")) %>%
group_by(commute) %>%
summarize(mean(days_to_email_response),
median(name_length),
max(years_of_study)) %>%
as.data.frame
Pipeline %>%
Work on your own with a pipeline %>%
Make your own query with dplyr and magrittr
library(purrr)
purr Cheatsheet
Use purrr to solve: split a data frame into pieces, fit a model to each piece, compute the summary, then extract the R2.
Use purrr
library(purrr)
participants_data_regression <-
participants_data %>%
split(.$batch) %>% # from base R
map(~
lm(days_to_email_response ~
daily_labor,
data = .)) %>%
map(summary) %>%
map_dbl("r.squared")
Work through tasks on the diamonds data in long format in base and short format with magrittr pipeline: